In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler
from sentence_transformers import SentenceTransformer
import rootutils
import warnings

warnings.filterwarnings("ignore")

root = rootutils.setup_root(search_from=".", indicator=".git")

DATA_DIR = root / "data"

Load the splitted Data¶

Add the df['content'] to the variable texts as a list and the df['label'] to labels.

In [2]:
# Daten laden
df_labeled = pd.read_parquet(DATA_DIR / "labeled.parquet")
df_unlabeled = pd.read_parquet(DATA_DIR / "unlabeled.parquet")
df = pd.concat([df_labeled, df_unlabeled], ignore_index=True)

texts = df["content"].tolist()
labels = df["label"].tolist()

Embeddings¶

Here I'm using the all-MiniLM-L6-v2 SentenceTransformers to generate embeddings. The all-MiniLM-L6-v2 is a practical and efficient choice for this sentiment analysis task.

  • Compact size: With only ~22M parameters, it runs fast even on CPUs and low-resource machines. It generates a dense empeddings size of 384, this relatively small vector size contributes to the model’s speed and low memory usage.
  • Strong performance: Despite it's small size, it achieves competitive results on semantic similarity, clustering and retrieval tasks.
  • Pretrained and ready-to-use: It's available through SentenceTransformers, optimized for sentece embeddings out of the box, with no need for further training.
  • Fast inference: Ideal for real-time or large-scale applications (e.g., search, RAG, semantic search), due to low latency and memory footprint.
In [3]:
# Embeddings generieren
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
Batches:   0%|          | 0/100 [00:00<?, ?it/s]

Similarity analysis¶

First I will calculate the cosine similarty between all positives, all negatives, everything and cross (positive and negatives) and see how the similarities are distributed. The distribution should give me insights, for example KNN, how the similiarities are distributed.

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

# 2) split by label
labels_arr = np.array(labels)
emb_positive = embeddings[labels_arr == 1]
emb_negative = embeddings[labels_arr == 0]

# 3) helper to flatten pairwise sims
def pairwise_sims(embs):
    if embs.shape[0] < 2:
        return []             # no pairs if fewer than 2 items
    S = cosine_similarity(embs)               # full (n × n) matrix
    # pick only the upper triangle, excluding the diagonal
    i, j = np.triu_indices_from(S, k=1)       # ← fixed here
    return S[i, j].tolist()

# 4) compute the three lists
pos_sims = pairwise_sims(emb_positive)
neg_sims = pairwise_sims(emb_negative)
all_sims = pairwise_sims(embeddings)

cross_sim_matrix = cosine_similarity(emb_positive, emb_negative)  # shape: (n_pos, n_neg)

# flatten
cross_sims = cross_sim_matrix.flatten().tolist()

df_sims = pd.DataFrame({
    "similarity": pos_sims + neg_sims + all_sims + cross_sims,
    "group": (["positive"] * len(pos_sims)) +
             (["negative"] * len(neg_sims)) +
             (["all"] * len(all_sims)) +
             (["cross"] * len(cross_sims))
})

print(df_sims.groupby("group")["similarity"].describe())
              count      mean       std       min       25%       50%  \
group                                                                   
all       5118400.0  0.106972  0.115990 -0.272383  0.025047  0.086888   
cross     2560000.0  0.100572  0.113953 -0.272383  0.020286  0.080903   
negative  1279200.0  0.103811  0.116051 -0.255295  0.022577  0.082938   
positive  1279200.0  0.122942  0.118450 -0.261645  0.038398  0.103329   

               75%       max  
group                         
all       0.169401  0.816704  
cross     0.161701  0.810774  
negative  0.164340  0.805941  
positive  0.188595  0.816704  
In [5]:
import seaborn as sns
import matplotlib.pyplot as plt


plt.figure(figsize=(10, 6))
sns.kdeplot(data=df_sims, x="similarity", hue="group", fill=True)
plt.title("Distribution of Cosine Similarities (Intra- und Cross-Class)")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Cosine Similarity Distribution Analysis¶

The plot shows the distribution of cosine similarities for different pairings of embedding vectors, grouped as follows:

  • positive: Similarities between pairs of embeddings with label = 1.
  • negative: Similarities between pairs of embeddings with label = 0.
  • all: Similarities between all possible pairs within the dataset.
  • cross: Similarities between positive and negative pairs (inter-class).

Key Insights:¶

  • The "positive" and "negative" curves are similar, indicating that intra-class similarities (within each label) are not drastically different.
  • The "cross" similarity distribution overlaps significantly with both "positive" and "negative", but peaks slightly lower — suggesting that cross-class similarities are generally lower than intra-class ones.
  • The "all" curve has the highest peak and widest base, as it mixes both intra- and inter-class pairs, capturing the full variability.

Implications for KNN or Similar Methods:¶

  • Since positive and negative pairs show considerable overlap in similarity, KNN may struggle unless embeddings are further refined to separate the classes.
  • However, the slight separation in means suggests that distance-based methods like KNN or thresholding might still be viable with tuning.
  • Consider using this insight to set a similarity threshold or to weigh neighbors differently depending on their class similarity profile.
In [6]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_sims, x="group", y="similarity")
plt.title("Boxplot of Cosine Similarities")
plt.xlabel("Group")
plt.ylabel("Cosine Similarity")
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [7]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_sims, x="group", y="similarity", showfliers=False)
plt.title("Boxplot of Cosine Similarities (without outliers)")
plt.xlabel("Group")
plt.ylabel("Cosine Similarity")
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [8]:
s_pos = pd.Series(pos_sims, name="positive")
s_neg = pd.Series(neg_sims, name="negative")
s_all = pd.Series(all_sims, name="all")

stats = pd.concat(
    [s_pos.describe(), s_neg.describe(), s_all.describe()],
    axis=1
)

print(stats)
           positive      negative           all
count  1.279200e+06  1.279200e+06  5.118400e+06
mean   1.229423e-01  1.038110e-01  1.069725e-01
std    1.184498e-01  1.160509e-01  1.159905e-01
min   -2.616448e-01 -2.552953e-01 -2.723833e-01
25%    3.839753e-02  2.257734e-02  2.504674e-02
50%    1.033288e-01  8.293789e-02  8.688784e-02
75%    1.885946e-01  1.643402e-01  1.694011e-01
max    8.167036e-01  8.059405e-01  8.167036e-01

umap¶

In [9]:
# UMAP
scaler = StandardScaler()
emb_scaled = scaler.fit_transform(embeddings)


umap = UMAP(n_components=2, n_neighbors=15, random_state=42)
umap_emb = umap.fit_transform(emb_scaled)
In [11]:
import plotly.express as px

# Create DataFrame for plotting
df = pd.DataFrame({
    'UMAP1': umap_emb[:, 0],
    'UMAP2': umap_emb[:, 1],
    'Label': ['Positiv' if l == 1 else 'Negativ' for l in labels],
    'Text': texts
})

# Create interactive scatter plot
fig = px.scatter(
    df,
    x='UMAP1',
    y='UMAP2',
    color='Label',
    hover_data=['Text'],
    title='UMAP Projection of Sentence Embeddings on full dataset',
    color_discrete_map={'Positiv': '#ff7f0e', 'Negativ': '#1f77b4'},
    template='plotly_white'
)

fig.update_traces(marker=dict(size=8))
fig.update_layout(
    title_x=0.5,
    title_font_size=20,
    showlegend=True,
    width=1000,
    height=800
)

fig.show()

UMAP Projection of Sentence Embeddings¶

This UMAP plot shows the 2D projection of my sentence embeddings, with each point representing a sentence from the dataset.

  • Orange points correspond to the Positiv label.
  • Blue points correspond to the Negativ label.

Observations:¶

  • The points form several dense clusters in different regions of the space.
  • There is overlap between Positiv and Negativ points across much of the space.
  • A distinct cluster appears on the left side, mostly populated by Positiv points.
  • Other regions contain a mix of both labels, with no clearly separated areas.
  • The overall distribution is continuous, without strong class-wise separation.

UMAP Projection of Sentence Embeddings on a 400 sample¶

UMAP Explanation